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Abstract Table 1: AURORA-2 digit recognition accuracy (%). 


Using the AURORA-2 digit recognition task, we show that 
recognition accuracies obtained with classical, SNR based or- 
acle masks can be substantially improved by using a state- 
dependent mask estimation technique. 
Index Terms: Noise Robust ASR, Missing Data Techniques 

1. Introduction 

In Missing Data Techniques (MDT) for noise robust automatic 
speech recognition (ASR), it is often implicitly assumed that 
using an SNR based oracle maskQ guarantees maximum recog- 
nition accuracy. Generally speaking, however, this is not neces- 
sarily true. 

In previous work |T|, the authors showed that the portions 
of an oracle mask which are important for recognition accuracy 
are speech sound dependent. In this paper we exploit this find- 
ing by a state dependent treatment of reliable features. Using a 
different mask estimator for every state in an HMM model and 
selecting the mask estimator for each time frame based on an 
externally provided state transcription, we generate a new type 
of oracle mask. In this paper we compare recognition accura- 
cies obtained with state-dependent oracle masks and classical 
oracle masks on the AURORA-2 digit recognition task. 

2. Experiments and results 

Experiments on test set A of the AURORA-2 digit corpus were 
carried out using a MATLAB HMM-based MDT recognizer in 
which the masks for delta coefficients were computed as the 
delta's of the static masks (cf. |2| for implementation and model 
details). Our state-dependent mask estimator was based on bi- 
nary SVM classifiers using LIBSVM. 

We trained separate SVM-models for all S = 179 HMM 
states and all K = 23 Mel frequency bands, resulting in 
SxK = 4117models. The frame-based SVM features we used 
consisted of 'Subband Energy to Subband Noise Floor Ratio' 
and 'Flatness' as in (3), the harmonic and random components 
of the noisy speech signal |4| and the noisy speech acoustic vec- 
tors. Reliability labels used in training were obtained from the 
(classical) oracle mask. Every state-specific SVM mask estima- 
tor was trained on the frames from the multi-condition train set, 
which were assigned to the same state by a forced alignment of 
the corpus utterances with the reference transcription. All 4117 
models were trained with the same SVM-feature vector. 

Table [TJ shows the recognition accuracies for the classical 
and the state-dependent oracle mask, as well as the accuracy 
gain. 

1 These oracle masks are computed by comparing spectro-temporal 
representations of the underlying speech and noise signals. Features 
dominated by speech energy are dubbed reliable; features dominated 
by noise energy unreliable. 
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Clearly, the state dependent method performs consistently better than 
the classical oracle mask, with larger gains in more adverse conditions. 

3. Discussion and Conclusions 

The classical, SNR based oracle mask only describes which static coef- 
ficients are reliable. Since treatment of dynamic features is missing data 
decoder specific the classical oracle mask is not necessarily the 'ideal' 
mask. Detailed analysis revealed that state dependent masks contain 
fewer isolated reliable elements than classical ones. In our setup (but 
also in those of others, e.g., 0) coarser granularity is beneficial for 
recognition performance, because isolated reliable mask coefficients 
can result in delta mask coefficients that are mistakenly labeled reliable. 

Our findings might also be useful for speech decoding without a pri- 
ori information about the state sequence. Oracle recognition accuracy 
would theoretically come within reach if, for each frame, one can afford 
to evaluate as many mask vectors as there are states (i.e. 179 in case of 
AURORA-2 ). 

This is a significant reduction of complexity as compared to the 
2 23 = 8.388.608 theoretically possible masks and without the loss of 
accuracy reported in 1 5 1. For small vocabulary tasks, today's computing 
power might even make a brute force approach feasible. 

Our future research, however, will focus on a further reduction of 
computational complexity by exploiting state transition constraints. 
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